This assignment utilises data from Amazon. The nodes in this network are Amazon products, including books, movies, and music. The edges in this network represent hyperlinks from a given product’s landing page to the landing pages of those products most frequently co-purchased with the given product.
The following data files have been used:
graph complete.txt: The edges of the graph in the form from ! to. Each line is an edge, with the origin node and destination node separated by a space. The data set includes 366,987 product nodes and 1,231,400 co-purchase edges.
graph subset rank1000.txt: A subset of the complete network, containing only products with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 1,355 product nodes and 2,611 co-purchase edges. Note: Multiple products may share the same salesrank in our data, so there are more than 1,000 products with salesrank under 1,000.
graph subset rank1000 cc.txt: The largest connected component in the network of prod-ucts with salesrank under 1,000. Each line is an edge where each node is separated by a space. The data set includes 292 product nodes and 604 co-purchase edges.
id to titles.txt: Maps the integer ids (primary keys) used to identify nodes to the actual names of the products. There are two space-separated felds in this fle: the integer id and the string title.
The raw data are available from the Stanford Network Analysis Project (http://snap.stanford.edu/data/amazon-meta.html) and were collected in summer 2006. The original dataset contains 548,552 records of books, movies, and music sold on Amazon.com, along with product categories, reviews, and information on co-purchased products. The data has been cleaned and altered the data as follows:
graph complete.txt: Removed discontinued products, and removed edges involving prod-ucts for which no metadata was available. That is, only kept only products that had a co-purchase link to another product in the dataset.
graph subset rank1000.txt: In addition to the above, kept only products that had a salesrank between 0 and 1,000, and kept only co-purchase links between items in this reduced set of products.
graph subset rank1000 cc.txt: In addition to the above, we kept only the largest connected component from this graph.
Plot the network using the information in the file graph_subset_rank1000.txt. Note that this is not the complete network, but only a subset of edges between top-ranked products. By visualizing the graph, you get an idea of the structure of the network you will be working on. In addition to plotting, comment on anything interesting you observe.
# Path to the txt files
path <- file.path("data", "graph_subset_rank1000.txt")
# Import the graph_subset_rank1000.txt file
graph_subset_rank1000 <- read.table(path,
header=FALSE,
sep = " ")
head(graph_subset_rank1000)
## V1 V2
## 1 411653 94292
## 2 68951 478494
## 3 236897 265343
## 4 265343 236897
## 5 472765 236897
## 6 153184 172503
# Convert to dataframe
graph_subset_rank1000_df <- as.data.frame(graph_subset_rank1000)
# Convert dataframe to an igraph object
graph_subset_rank1000_ig <- graph_from_data_frame(graph_subset_rank1000_df, directed = FALSE)
Before analysing the network utilising a better configuration, the raw data is displayed in a simple plot
# Plot using standard plot
plot(graph_subset_rank1000_ig)
To better convey insights from the network, an improved style format has been utilsied, with a number of different layout options utilised:
# Black background
par(bg = "black")
# Plot using auto plot
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout.auto(graph_subset_rank1000_ig))
# Black background
par(bg = "black")
# Plot the graph object in a MDS layout
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_with_mds(graph_subset_rank1000_ig))
# Black background
par(bg = "black")
# Plot the graph object in a Tree layout
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_as_tree(graph_subset_rank1000_ig))
# Black background
par(bg = "black")
# Plot using layout nicely
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_nicely(graph_subset_rank1000_ig))
# Black background
par(bg = "black")
# Plot using layout circle
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_in_circle(graph_subset_rank1000_ig))
# Black background
par(bg = "black")
# Plot using layout kamada kawai
plot(graph_subset_rank1000_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout.kamada.kawai(graph_subset_rank1000_ig))
In addition to visually displaying the network, there are a number of attributes that can be calculated to describe the network. These include:
# Number of vertices
Gorder1000 <- gorder(graph_subset_rank1000_ig)
paste("The number of vertices is", Gorder1000)
## [1] "The number of vertices is 1355"
# Number of Edges
Size1000 <- gsize(graph_subset_rank1000_ig)
paste("The number of edges is", Size1000)
## [1] "The number of edges is 2611"
# Edge density
ED1000 <- round(edge_density(graph_subset_rank1000_ig),3)
paste("The edge density is", ED1000)
## [1] "The edge density is 0.003"
# Average distance between between vertices
AD1000 <- round(mean_distance(graph_subset_rank1000_ig, directed = FALSE),3)
paste("The average distance between vertices is", AD1000)
## [1] "The average distance between vertices is 8.942"
# Transitivity
T1000 <- round(transitivity(graph_subset_rank1000_ig),3)
paste("The transitivity is", T1000)
## [1] "The transitivity is 0.411"
Observations
Now, use the file graph subset rank1000 cc.txt to plot only the largest connected compo-nent in the above network. You should be able to reuse your code from above on the new data.
# Path to the txt files
path <- file.path("data", "graph_subset_rank1000_cc.txt")
# Import the graph_subset_rank1000_cc.txt file
graph_subset_rank1000_cc <- read.table(path,
header=FALSE,
sep = " ")
# Convert to dataframe
graph_subset_rank1000_cc_df <- as.data.frame(graph_subset_rank1000_cc)
# Convert dataframe to an igraph object
graph_subset_rank1000_cc_ig <- graph_from_data_frame(graph_subset_rank1000_cc_df, directed = FALSE)
Fo this exercise I have used three layouts to assess the largest node in the network:
# Black background
par(bg = "black")
# Plot using layout nicely
plot(graph_subset_rank1000_cc_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_nicely(graph_subset_rank1000_cc_ig))
# Black background
par(bg = "black")
# Plot using layout kamada kawai
plot(graph_subset_rank1000_cc_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout.kamada.kawai(graph_subset_rank1000_cc_ig))
# Black background
par(bg = "black")
# Plot using layout kamada kawai
plot(graph_subset_rank1000_cc_ig,
# === Vertex
vertex.color = rgb(0.8,0.4,0.3,0.8),
vertex.frame.color = "white",
vertex.shape="circle",
vertex.size=8,
# === Vertax labels
vertex.label.color="white",
vertex.label.font=2,
vertex.label.cex=0.4,
vertex.label.dist=0,
vertex.label.degree=0,
# === Edge
edge.color="white",
edge.width=4,
edge.arrow.size=1,
edge.arrow.width=1,
edge.lty="solid",
edge.curved=0.3,
# Layout
layout = layout_with_mds(graph_subset_rank1000_cc_ig))
# Number of vertices
Gorder1000_cc <- gorder(graph_subset_rank1000_cc_ig)
paste("The number of vertices is", Gorder1000_cc)
## [1] "The number of vertices is 292"
# Number of Edges
Size1000_cc <- gsize(graph_subset_rank1000_cc_ig)
paste("The number of edges is", Size1000_cc)
## [1] "The number of edges is 604"
# Edge density
ED1000_cc <- round(edge_density(graph_subset_rank1000_cc_ig),3)
paste("The edge density is", ED1000_cc)
## [1] "The edge density is 0.014"
# Average distance between between vertices
AD1000_cc <- round(mean_distance(graph_subset_rank1000_cc_ig, directed = FALSE),3)
paste("The average distance between vertices is", AD1000_cc)
## [1] "The average distance between vertices is 10.305"
# Transitivity
T1000_cc <- round(transitivity(graph_subset_rank1000_cc_ig),3)
paste("The transitivity is", T1000_cc)
## [1] "The transitivity is 0.261"
Observations
The layouts used cleary show a single connected network. There also appear to be a number of key nodes that tie the network together
The rest of the assignment uses the complete graph contained in the file graph complete.txt and the title file id to titles.csv
Plot the out-degree distribution of our dataset (x-axis number of similar products, y-axis number of nodes). That is, for each product a, count the number of outgoing links to another product page b such that a -> b.
# Path to the txt files
path <- file.path("data", "graph_complete.txt")
# Import the graph_complete.txt file
graph_complete <- read.table(path,
header=FALSE,
sep = " ")
# Convert to data frame
graph_complete_df <- as.data.frame(graph_complete)
# Create list of unique nodes across inbound and outbound columns
unique_nodes <- unique(data.frame(V1=unlist(graph_complete_df, use.names = FALSE)))
# Convert to factors
unique_nodes_v <- as.numeric(unique_nodes[["V1"]])
unique_nodes_f <- as.factor(unique_nodes_v)
# Get standalone list of out nodes in V1 and convert to a factor
out_nodes <- graph_complete_df[, 1, drop = FALSE]
out_nodes_v <- as.numeric(out_nodes[["V1"]])
out_nodes_f <- as.factor(out_nodes_v)
# Use table function to determine frequency of outbound nodes against unique list
table_nodes_out <- as.data.frame(table(unique_nodes_f[match(out_nodes_f, unique_nodes_f)]))
summary(table_nodes_out$Freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 3.355 4.000 5.000
# Load windows font calibra
windowsFonts("Calibra" = windowsFont("Calibra"))
# Create RC chart attributes
rc_chartattributes1 <- theme_bw() +
theme(text=element_text(family="Calibra")) +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "gray"),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(color = "black", size = 30, face = "bold"),
plot.subtitle = element_text(color = "gray45", size = 17),
plot.caption = element_text(color = "gray45", size = 10, face = "italic", hjust = 0))
table_nodes_out_chart <- ggplot(data = table_nodes_out) +
geom_histogram(aes(Freq), bins = 6, fill = "turquoise", position = "identity", alpha = 0.4) +
labs(title = "Out-degree distribution of Amazon data set",
subtitle = "The maximum number of outbound conenctions is 5. However, most nodes appear to have 4 outbound connections. \n Some nodes have zero outbound",
caption = "http://snap.stanford.edu/data/amazon-meta.html",
x = "Number of similar products",
y = "Number of nodes") +
scale_x_continuous(labels = comma) +
rc_chartattributes1
table_nodes_out_chart
# Get standalone list of in nodes in V2 and convert to a factor
in_nodes <- graph_complete_df[, 2, drop = FALSE]
in_nodes_v <- as.numeric(in_nodes[["V2"]])
in_nodes_f <- as.factor(in_nodes_v)
# Use table function to determine frequency of outbound nodes against unique list
table_nodes_in <- as.data.frame(table(unique_nodes_f[match(in_nodes_f, unique_nodes_f)]))
summary(table_nodes_in$Freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 3.355 4.000 549.000
# Create histogram
table_nodes_in_chart <- ggplot(data = table_nodes_in) +
geom_histogram(aes(Freq), binwidth = 1, fill = "turquoise", position = "identity", alpha = 0.4) +
labs(title = "In-degree distribution of Amazon data set",
subtitle = "The distribution is very different for inbound; some products have over 500 inbound. \n However, a significant number appear to have zero or one connection",
caption = "http://snap.stanford.edu/data/amazon-meta.html",
x = "Number of similar products",
y = "Number of nodes") +
scale_x_continuous(labels = comma) +
rc_chartattributes1
table_nodes_in_chart
Observations
Transform the x-axis of the previous graph to log scale, to get a better understanding of the distribution. Note here that you should have some products with 0 inbound links. This means that using the log of the x-axis will fail since log(0) will not be valid. Due to this, you should replace 0 with 0:1. Comment on what you observe.
# Create duplicate colum on which to undertake transformation
table_nodes_in$Freq2 <- table_nodes_in$Freq
# Replace 0 with 0.1 to avoid log issues
table_nodes_in[table_nodes_in$Freq2 == 0, ] = 0.1
# Check 0.1 in now minimum in summary
summary(table_nodes_in$Freq2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.100 2.000 3.391 4.000 549.000
# Log Frequency column
table_nodes_in$Freqlog <- log(table_nodes_in$Freq2)
# Load windows font calibra
windowsFonts("Calibra" = windowsFont("Calibra"))
# Create RC chart attributes
rc_chartattributes1 <- theme_bw() +
theme(text=element_text(family="Calibra")) +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "gray"),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(color = "black", size = 30, face = "bold"),
plot.subtitle = element_text(color = "gray45", size = 20),
plot.caption = element_text(color = "gray45", size = 10, face = "italic", hjust = 0))
table_nodes_in_log_chart <- ggplot(data = table_nodes_in) +
geom_histogram(aes(Freqlog), binwidth = 1, fill = "turquoise", position = "identity", alpha = 0.4) +
labs(title = "Log In-degree distribution of Amazon data set",
subtitle = "Logging the frequency shows most products are being purchased directly",
caption = "http://snap.stanford.edu/data/amazon-meta.html",
x = "Number of similar products",
y = "Number of nodes") +
scale_x_continuous(labels = comma) +
rc_chartattributes1
table_nodes_in_log_chart
Observations
Compute the average number of inbound co-purchase links, the standard deviation, and the maximum. Comment on the result.
# Average number of inbound
mean(table_nodes_in$Freq)
## [1] 3.390782
# Standard deviation of inbound
sd(table_nodes_in$Freq)
## [1] 5.953466
# Maximum number of inbound
max(table_nodes_in$Freq)
## [1] 549
Observations
Report the names of the 10 products with the most inbound co-purchase links.
# Order by descending
table_nodes_in_sorted <- table_nodes_in[order(table_nodes_in$Freq, decreasing = TRUE), ]
# Create a subset for the top 10
table_nodes_in_sorted_top10 <- head(table_nodes_in_sorted, n=10)
# Change column names
names(table_nodes_in_sorted_top10) <- c("id", "Freq")
# Load product name txt files
id_to_titles <- read_csv("data/id_to_titles.csv")
# Convert to data frame
id_to_titles_df <- as.data.frame(id_to_titles)
# Merge data sets
Top_10_names <- merge(table_nodes_in_sorted_top10[, c("id", "Freq")],
id_to_titles_df[,c("id", "title")])
# Rank by descending order
Top_10_names <- Top_10_names[order(Top_10_names$Freq, decreasing = TRUE), ]
# Maintain order for charting
Top_10_names$title <- factor(Top_10_names$title, levels = Top_10_names$title[order(Top_10_names$Freq)])
# Create bar chart for top 10
Top_10_names_chart <- ggplot(data = Top_10_names) +
geom_bar(aes(x = title, y = Freq), stat="identity", fill = "gold1") +
labs(title = "Top 10 inbound co-purchase",
subtitle = "Laura has the most inbound links in the Top 10 ",
caption = "http://snap.stanford.edu/data/amazon-meta.html",
x = "Product Title",
y = "Number of Inbound Co-Purchase Links") +
scale_y_continuous(labels = comma) +
rc_chartattributes1
Top_10_names_chart + coord_flip()
Observations